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Abstract 

We address the problem of minimizing a con- 
vex function over the space of large matri- 
ces with low rank. While this optimization 
problem is hard in general, we propose an ef- 
ficient greedy algorithm and derive its formal 
approximation guarantees. Each iteration of 
the algorithm involves (approximately) find- 
ing the left and right singular vectors cor- 
responding to the largest singular value of 
a certain matrix, which can be calculated 
in linear time. This leads to an algorithm 
which can scale to large matrices arising in 
several applications such as matrix comple- 
tion for collaborative filtering and robust low 
rank matrix approximation. 



1. Introduction 

Our goal is to approximately solve an optimization 
problem of the form: 

min R(A) , (1) 

A:rank(A)<r 

where R : M. mxn — > R is a convex and smooth function. 
This problem arises in many machine learning appli- 
cations such as collaborating filtering (Korcn et al., 
2009), robust low rank matrix approximation (Ke & 
Kanadc, 2005; Croux & Filzmoser, 1998; A. Baccini & 
Falguerolles, 1996), and multiclass classification (Amit 
et al., 2007). The rank constraint on A is non-convex 
and therefore it is generally NP-hard to solve Equa- 
tion (1) (this follows from (Natarajan, 1995; Davis 
et al, 1997)). 
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In this paper we describe and analyze an approxi- 
mation algorithm for solving Equation (1). Roughly 
speaking, the proposed algorithm is based on a sim- 
ple, yet powerful, observation: instead of representing 
a matrix A using m x n numbers, we represent it us- 
ing an infinite dimensional vector A, indexed by all 
pairs (u, v) taken from the unit spheres of E m and W 1 
respectively. In this representation, low rank corre- 
sponds to sparsity of the vector A. 

Thus, we can reduce the problem given in Equation (1) 
to the problem of minimizing a vector function /(A) 
over the set of sparse vectors, ||A||o < r. Based on 
this reduction, we apply a greedy approximation algo- 
rithm for minimizing a convex vector function subject 
to a sparsity constraint. At first glance, a direct ap- 
plication of this reduction seems impossible, since A 
is an infinite-dimensional vector, and at each iteration 
of the greedy algorithm one needs to search over the 
infinite set of the coordinates of A. However, we show 
that this search problem can be cast as the problem of 
finding the first leading right and left singular vectors 
of a certain matrix. 

After describing and analyzing the general algorithm, 
we show how to apply it to the problems of matrix 
completion and robust low-rank matrix approxima- 
tion. As a side benefit, our general analysis yields 
a new sample complexity bound for matrix comple- 
tion. We demonstrate the efficacy of our algorithm 
by conducting experiments on large-scale movie rec- 
ommendation data sets. 

1.1. Related work 

As mentioned earlier, the problem defined in Equa- 
tion (1) has many applications, and therefore it was 
studied in various contexts. A popular approach is to 
use the trace norm as a surrogate for the rank (e.g. 
(Fazel et al., 2002)). This approach is closely related 
to the idea of using the l\ norm as a surrogate for spar- 
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sity, because low rank corresponds to sparsity of the 
vector of singular values and the trace norm is the l\ 
norm of the vector of singular values. This approach 
has been extensively studied, mainly in the context 
of collaborating filtering. See for example (Cai et al., 
2008; Candes & Plan, 2010; Candes & Recht, 2009; 
Keshavan ct al., 2010; Kcshavan & Oh, 2009). 

While the trace norm encourages low rank solutions, it 
does not always produce sparse solutions. Generalizing 
recent studies in compressed sensing, several papers 
(e.g. (Recht et al., 2007; Cai et al., 2008; Candes & 
Plan, 2010; Candes & Recht, 2009; Recht, to appear)) 
give recovery guarantees for the trace norm approach. 
However, these guarantees rely on rather strong as- 
sumptions (e.g., it is assumed that the data is indeed 
generated by a low rank matrix, that certain inco- 
herence assumptions hold, and for matrix completion 
problems, it requires the entries to be sampled uni- 
formly at random) . In addition, trace norm minimiza- 
tion often involves semi-definite programming, which 
usually does not scale well to large-scale problems. 

In this paper we tackle the rank minimization directly, 
using a greedy selection approach, without relying on 
the trace norm as a convex surrogate. Our approach 
is similar to forward greedy selection approaches for 
optimization with sparsity constraint (e.g. the MP 
(Mallat & Zhang, 1993) and OMP (Pati et al., 2002) 
algorithms), and in particular we extend the fully cor- 
rective forward greedy selection algorithm given in 
(Shalev-Shwartz et al., 2010)). We also provide formal 
guarantees on the competitiveness of our algorithm rel- 
ative to matrices with small trace norm. 

Recently, (Lee & Bresler, 2010) proposed the ADMiRA 
algorithm, which also follows the greedy approach. 
However, the ADMiRA algorithm is different, as in 
each step it first chooses 2r components and then uses 
SVD to revert back to a r rank matrix. This is more 
expensive then our algorithm which chooses a single 
rank 1 matrix at each step. The difference between 
the two algorithms is somewhat similar to the differ- 
ence between the OMP (Pati et al., 2002) algorithm for 
learning sparse vectors, to CoSaMP (Needell & Tropp, 
2009) and SP (Dai & Milenkovic, 2008). In addition, 
the ADMiRA algorithm is specific to the squared loss 
while our algorithm can handle any smooth loss. Fi- 
nally, while ADMiRA comes with elegant performance 
guarantees, these rely on strong assumptions, e.g. that 
the matrix defining the quadratic loss satisfies a rank- 
restricted isometry property. In contrast, our analysis 
only assumes smoothness of the loss function. 

The algorithm we propose is also related to Hazan's 
algorithm (Hazan, 2008) for solving PSD problems, 



which in turns relies on Frank- Wolfe algorithm (Frank 
& Wolfe, 1956) (see Clarkson (Clarkson, 2008)), as well 
as to the follow-up paper of (Jaggi & Sulovsky, 2010), 
which applies Hazan's algorithm for optimizing with 
trace-norm constraints. There are several important 
changes though. First, we tackle the problem directly 
and do not enforce neither PSDness of the matrix nor 
a bounded trace- norm. Second, our algorithm is "fully 
corrective" , that is, it extracts all the information from 
existing components before adding a new component. 
These differences between the approaches are analo- 
gous to the difference between Frank- Wolfe algorithm 
and fully corrective greedy selection, for minimizing 
over sparse vectors, as discussed in (Shalev-Shwartz 
et al., 2010). Finally, while each iteration of both 
methods involves approximately finding leading eigen- 
vectors, in (Hazan, 2008) the quality of approximation 
should improve as the algorithm progresses while our 
algorithm can always rely on the same constant ap- 
proximation factor. 

2. The GECO algorithm 

In this section we describe our algorithm, which we 
call Greedy Efficient Component Optimization (or 
GECO for short). Let A e R mxn be a matrix, 
and without loss of generality assume that m < n. 
The SVD theorem states that A can be written as 
A = J^iLi ^i u t v I \ where Ux, . . . , u m are members of 
U = {u G R rn : \\u\\ = 1}, vi,...,v m comes from 
V = {u£ M" : \\v\\ = 1}, and Ai, . . . , A m are scalars. 
To simplify the presentation, we assume that each real 
number is represented using a finite number of bits, 
therefore the sets U and V are finite sets. 1 It follows 
that we can also write A as A = ^2^ u v ) e u x v A Ujt ,iw T , 
where A € R\ UxV \ and we index the elements of A using 
pairs (u, v) 6WxV. Note that the representation of A 
using a vector A is not unique, but from the SVD the- 
orem, there is always a representation of A for which 
the number of non-zero elements of A is at most m, i.e. 
||A||o < m where ||A|| = \{(u,v) : \ u ^ v ^ 0}|. Further- 
more, if rank(j4) < r then there is a representation of 
A using a vector A for which ||A||o < r. 

Given a (sparse) vector A £ Rl WxV l we define the cor- 

1 This assumption greatly simplifies the presentation but 
is not very limiting since we do not impose any restriction 
on the amount of bits needed to represent a single real 
number. We note that the assumption is not necessary 
and can be waived by writing A — v - )eUxV uv T d\{u, v), 
where A is a measure onWxV, and from the SVD theorem, 
there is always a representation with A which is non-zero 
on finitely many points. 
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Algorithm 1 GECO 



Input: Convex-smooth function R : TR mxrl 
rank constraint r ; tolerance r 6 [0, 1/2] 
Initialize: U = [}, V = [] 
for i=l,. . . ,r do 

{u,v) = ApproxSV(V-R(LT T ),r) 
Set U = [U , u] and V = [V , v] 
Set B = argmin B . eK , xl R(UBV T ) 
Calculate SVD: B = PDQ T 
Update: U = UPD, V = VQ 
end for 



responding matrix to be 

A(X) = Y KvUV T . 
(u,v)eux\> 

Note that A(X) is a linear mapping. Given a function 
R : R mxn -> R, we define a function 



/(A) = R(A(X)) = R | £ 



It is easy to verify that if R is a convex function over 
R mx ™ then / is convex over ]Rl WxV l (since / is a com- 
position of R over a linear mapping) . We can therefore 
reduce the problem given in Equation (1) to the prob- 
lem 

min /(A) . (2) 

While the optimization problem given in Equation (2) 
is over an arbitrary large space, we next show that 
a forward greedy selection procedure can be imple- 
mented efficiently. The greedy algorithm starts with 
A = (0, . . . , 0). At each iteration, we first find the vec- 
tors (u, v) that maximizes the magnitude of the partial 
derivative of /(A) with respect to X u ,v Assuming that 
R is differentiable, and using the chain rule, we obtain: 



df(X) 
dX u .v 



= (WR(A(X)),uv T ) = u T WR{A(X))v 



where Vi?(v4(A)) is the mx n matrix of partial deriva- 
tives of R with respect to the elements of A(X). The 
vectors u, v that maximizes the magnitude of the above 
expression are the left and right singular vectors corre- 
sponding to the maximal singular value of Vi?(A(A)). 
Therefore, even though the number of elements in 
U x V is very large, we can still perform a greedy se- 
lection of one pair (u, v) G hi x V in an efficient way. 

In some situations, even the calculation of the leading 
singular vectors might be too expensive. We there- 
fore allow approximate maximization, and denote by 



ApproxSV(Vi?(v4(A)), r) a procedure 2 which returns 
vectors for which 

u T VR{A{X))v > (1 - r) maxp T Vi?(A(A))<7 . 



Let U and V be matrices whose columns contain the 
vectors u and v we aggregated so far. The second 
step of each iteration of the algorithm sets A to be the 
solution of the following optimization problem: 

min /(A) s.t. supp(A) C span(C7) x span(y), (3) 

AeRi" xV i 

where supp(A) = {(u,v) : X u>v ^ 0}, and 
span(f7), span(V) are the linear spans of the columns 
of U, V respectively. 

We now describe how to solve Equation (3). Let s 
be the number of columns of U and V. Note that 
any vector u € span(?7) can be written as Ub u , where 
b u £R S , and similarly, any v £ span(V) can be written 
as Vb v . Therefore, if the support of A is in span([7) x 
span(V) we have that A(X) can be written as 



A(X) 



Y, Kv(ub u )(vb v y j 

(ii,i;)Gsupp(A) 



u 



E 



V 1 



Thus, any A whose support is in span(Z7) x span(V) 
yields a matrix B(X) = J2 u ,v K,vKtf- The SVD the- 
orem tells us that the opposite direction is also true, 
namely, for any B G M. sxs there exists A whose sup- 
port is in span(t/) x span(V) that generates B (and 
also UBV T ). Denote R(B) = R(UBV T ), it follows 
that Equation (3) is equivalent to the following uncon- 
strained optimization problem min BeRs xs R(B). It is 
easy to verify that R is a convex function, and there- 
fore can be minimized efficiently. Once we obtain the 
matrix B that minimizes R(B) we can use its SVD to 
generate the corresponding A. 

In practice, we do not need to maintain A at all, but 
only to maintain matrices U, V such that A(X) — UV T . 



2 An example of such a procedure is the power iter- 
ation method, which can implement ApproxSV in time 
0(N log (n)/r), where is the number of non-zero ele- 
ments of X7R(A(X)). See Theorem 3.1 in (Kuczyhski & 
Wozniakowski, 1992). Our analysis shows that the value 
of r has a mild effect on the convergence of GECO, and 
one can even choose a constant value like r = 1/2. This is 
in contrast to (Hazan, 2008; Jaggi & Sulovsky, 2010) which 
require the approximation parameter to decrease when the 
rank increases. Note also that the ApproxEV procedure de- 
scribed in (Hazan, 2008; Jaggi & Sulovsky, 2010) requires 
an additive approximation, while we require a multiplica- 
tive approximation. 
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A summary of the pseudo-code is given in Algorithm 
1. The runtime of the algorithm is as follows. Step 4 
can be performed in time 0(-/Vlog(n)/r), where N is 
the number of non zero elements of V R(UV T ), using 
the power method (see Footnote 2). Since our analy- 
sis (given in Section 3) allows r to be a constant (e.g. 
1/2), this means that the runtime is 0(N\og(n)). The 
runtime of Step 6 depends on the structure of the func- 
tion R. We specify it when describing specific applica- 
tions of GECO in later sections. Finally, the runtime 
of Step 7 is at most r 3 , and step 8 takes 0(r 2 (m + n)). 

2.1. Variants of GECO 

2.1.1. HOW TO CHOOSE (u,v) 

GECO chooses (u, v) to be the leading singular vec- 
tors, which are the maximizers of u T VR(A) v over unit 
spheres of M m and E" . Our analysis in the next section 
guarantees that this choice yields a sufficient decrease 
of the objective function. However, there may be a pair 
(it, v) which leads to an even larger decrease in the ob- 
jective value. Choosing such a direction can lead to 
improved performance. We note that our analysis in 
the next section still holds, as long as the direction 
we choose leads to a larger decrease in the objective 
value, relative to the increase we can get from using 
the leading singular vectors. In Section 6 we describe 
a method that finds better directions. 

2.1.2. Additional replacement steps 

Each iteration of GECO increases the rank by 1. In 
many cases, it is possible to decrease the objective by 
replacing one of the components without increasing 
the rank. If we verify that this replacement step indeed 
decreases the objective (by simply evaluating the ob- 
jective before and after the change), then the analysis 
we present in the next section remains valid. We now 
describe a simple way to perform a replacement. We 
start with finding a candidate pair (14, v) and perform 
steps 5—7 of GECO. Then, we approximate the matrix 
B by zeroing its smallest singular value. Let B denote 
this approximation. We next check if R(UBV T ) is 
strictly smaller than the previous objective value. If 
yes, we update U, V based on B and obtain that the 
rank of UV T has not been increased while the objec- 
tive has been decreased. Otherwise, we update U, V 
based on B, thus increasing the rank, but our analysis 
tells us that we are guaranteed to sufficiently decrease 
the objective. If we restrict the algorithm to perform 
at most 0(1) attempted replacement steps between 
each rank-increasing iteration, then its runtime guar- 
antee is only increased by an O(l) factor, and all the 
convergence guarantees remain valid. 



2.1.3. Adding Schatten norm regularization 

In some situations, rank constraint is not enough for 
obtaining good generalization guarantees and one can 
consider objective functions R(A) which contains addi- 
tional regularization of the form h(X(A)), where X(A) 
is the vector of singular values of A and ft is a vec- 
tor function such as h(x) — For example, if 
p = 2, this regularization term is equivalent to Frobe- 
nius norm regularization of A. In general, adding a 
convex regularization term should not pose any prob- 
lem. A simple trick to do this is to orthonormalize the 
columns of U and V before Step 6. Therefore, for any 
B, the singular values of B equal the singular values 
of UBV T . Thus, we can solve the problem in Step 
6 more efficiently while regularizing B instead of the 
larger matrix UBV T . 

2.1.4. Optimizing over diagonal matrices B 

Step 6 of GECO involves solving a problem with i 2 
variables, where i € {1, . . . , r}. When r is small this 
is a reasonable computational effort. However, when 
r is large, Steps 6 — 7 can be expensive. For exam- 
ple, in matrix completion problems, the complexity of 
Step 6 can scale with r 6 . If runtime is important, it 
is possible to restrict B to be a diagonal matrix, or 
in other words, we only optimize over the coefficients 
of A corresponding to U and V without changing the 
support of A. Thus, in step 6 we solve a problem with 
i variables, and Step 7 is not needed. It is possible to 
verify that the analysis we give in the next section still 
holds for this variant. 

3. Analysis 

In this section we give a competitive analysis for 
GECO. The first theorem shows that after perform- 
ing r iterations of GECO, its solution is not much 
worse than the solution of all matrices A, whose trace 
norm 3 is bounded by a function of r. The second the- 
orem shows that with additional assumptions, we can 
be competitive with matrices whose rank is at most 
r. The proofs can be found in the long version of this 
paper. 

To formally state the theorems we first need to define 
a smoothness property of the function /. 

Definition 1 (smoothness) We say that f is (3- 
smooth if for any A and (u,v) G hi x V we have 

3 The trace norm of a matrix is the sum of its singular 
values. 
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where e u ' v is the all zeros vector except 1 in the coordi- 
nate corresponds to (u,v). We say that R is /3-smooth 
if the function /(A) = R(A(X)) is (3-smooth. 

Theorem 1 Fix some e > 0. Assume that GECO 
(or one of its variants) is run with a (3-smooth func- 
tion R, a rank constraint r, and a tolerance parameter 
t G [0,1). Let A he its output matrix. Then, for all 
matrices A with 



WMl < 



e(r + l)(l-T) 2 

2(3 



we have that R(A) < R(A) + e. 

The previous theorem shows competitiveness with ma- 
trices of low trace norm. Our second theorem shows 
that with additional assumptions on the function / we 
can be competitive with matrices of low rank as well. 
We need the following definition. 

Definition 2 (strong convexity) Let L C U x V. 

We say that f is a -strongly- convex over I if for any 
Ai, A2 whose support 4 is in L we have 

/(Ax) - /(A 2 ) - (V/(A 2 ), A a - A 2 ) > |||A! - A 2 || 2 ■ 

We say that R is a -strongly-convex over I if the func- 
tion /(A) = R(A(X)) is a -strongly- convex over I . 

Theorem 2 Assume that the conditions of Theorem 1 
hold. Then, for any A such that 



rank(A) < 



e(r + l)(l-r) 2 CT 
4/3i?(0) 



and such that R is a -strongly-convex over the singular 
vectors of A, we have that R(A) < R(A) + e. 

We discuss the implications of these theorems for sev- 
eral applications in the next sections. 

4. Application I: Matrix Completion 

Matrix completion is the problem of predicting the 
entries of some unknown target matrix Y € jjmx« 
based on a random subset of observed entries, E C 
[m] x [n]. For example, in the famous Netflix problem, 
m represents the number of users, n represents the 
number of movies, and Y%j is a rating user i gives to 
movie j. One approach for learning the matrix Y is 
to find a matrix A of low rank which approximately 
agrees with Y on the entries of E (in mean squared 



error terms). Using the notation of this paper, we 
would like to minimize the objective 



R(A) 



1 

W\ 



Y -) 2 



over low rank matrices A. 



We now specify GECO for this objective function. It 
is easy to verify that the (i, j) element of WR(A) is 



2{A, 



Y,j) if G E and otherwise. The 



number of non-zero elements of VR(A) is at most 
\E\, and therefore Step 4 of GECO can be imple- 
mented using the power method in time O ( | £7 1 log(n)). 
Given matrices U, V, let ui be the i'th row of U 
and Vj be the j'th row of V. We have that the 
element of the matrix UBV T can be written as 
(vec(ufvj), vec(-B)}, where vec of a matrix is the vec- 
tor obtained by taking all the elements of the matrix 
column wise. We can therefore rewrite R(UBV f ) as 

]W\ T,( l j)eE(( vec ( u I v 3)^ vec ( B )) ~ which makes 
Step 6 of GECO a vanilla least squares problem over at 
most r 2 variables. The runtime of this step is therefore 
bounded by 0(r 6 + \E\r 2 ). 

4.1. Analysis 

To apply our analysis for matrix completion we first 
bound the smoothness parameter. 

Lemma 1 For matrix completion the smoothness pa- 
rameter is at most 2/\E\. 

Proof For any u, v and i,j we can rewrite (Ajj + 
ijUiVj - Y hJ ) 2 as 



(A id - Y tJ ) 2 + 2(Aij - Yij) rjUiVj + if 2 iff 1:] 
Taking expectation over (i,j) e£we obtain: 

/(A + r ? e^)</(A)+r / V u ,„/(A) + r / 2 ^- ]T u 2 v 



2 

1 (i,i)es 



Since 
follows. 



< J2i u iJ2j v j = 1) tne proof 



The support of A is the set of (u, v) for which X UtV 7^ 0. 



Our general analysis therefore implies that for any 
A, GECO can find a matrix with rank r < 
0(||A|| 2 r /(e|£;|)), such that R(A) < R(A) + e. 

Let us now discuss the implications of this result for 
the number of observed entries required for predicting 
the entire entries of Y . Suppose that the entries E 
are sampled i.i.d. from some unknown distribution 
D e R rnxn , D id > for all i,j and V, , /.),.., = 1. 
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Denote the generalization error of a matrix A by 
F(A) = Y^D i j(A iJ -Y ilj f . 

Using generalization bounds for low rank matrices (e.g. 
(Srebro et al., 2005)), it is possible to show that for 
any matrix A of rank at most r we have that with 
high probability 5 

\F(A) - R(A)\ < 6Wr{m + n)/\E\) . 

Combining this with our analysis for GECO, and op- 
timizing e, it is easy to derive the following: 

Corollary 1 Fix some matrix A. Then, GECO can 
find a matrix A such that with high probability over the 
choice of the entries in E 

Without loss of generality assume that m < n. It 
follows that if 1 1 A || tr is order of y/mn then order of n 3 / 2 
entries are suffices to learn the matrix Y . This matches 
recent learning-theoretic guarantees for distribution- 
free learning with the trace norm (Shalev-Shwartz & 
Shamir, 2011). 

5. Application II: Robust Low Rank 
Matrix Approximation 

A very common problem in data analysis is finding a 
low-rank matrix A which approximates a given matrix 
Y, namely solving min^. ran i ! M)< r d(A, Y), where d is 
some discrepancy measure. For simplicity, assume that 
y g R«xn_ When d(A, V) is the normalized Frobenius 
norm d(A, V) = ^ J2i,j(^h3~ Y i,j) 2 i tnis problem can 
be solved efficiently via SVD. However, due to the use 
of the Frobenius norm, this procedure is well-known 
to be sensitive to outliers. 

One way to make the procedure more robust is to re- 
place the Frobenius norm by a less sensitive norm, 
such as the l\ norm d(A, V) — ^ J2i j \ ~ I ( see 
for instance (A. Baccini & Falguerolles, 1996),(Croux 
& Filzmoser, 1998), (Ke & Kanadc, 2005)). Unfor- 
tunately, there are no known efficient algorithms to 
obtain the global optimum of this objective function, 
subject to a rank constraint on A. However, using 

J To be more precise, this bound requires that the ele- 
ments of A axe bounded by a constant. But, since we can 
assume that the elements of Y are bounded by a constant, 
it is always possible to clip the elements of A to the range 
of the elements of Y without increasing F(A). 



our proposed algorithm, we can efficiently find a low- 
rank matrix which approximately minimizes d( A, V) . 
In particular, we can apply it to any convex discrep- 
ancy measure d, including robust ones such as the l\ 
norm. The only technicality is that our algorithm re- 
quires d to be smooth, which is not true in the case of 
the l\ norm. However, this can be easily alleviated by 
working with smoothed versions of the l\ norm, which 
replace the absolute value by a smooth approximation. 
One example is a Huber loss, defined as L(x) = x 2 /2 
for |a;| < 1, and L(x) = \x\ — 1/2 otherwise. 

Lemma 2 The smoothness parameter of d(A, Y) = 
^2 Yli j L{Aij — Y%j), where L is the Huber loss, is at 
most 1/n 2 . 

Proof It is easy to verify that the smoothness param- 
eter of L(x) is 1, since L(x) is upper bounded by the 
parabola x 2 /2, whose smoothness parameter is exactly 
1. Therefore, 

L(A itj + rjUiVj - Y i} j) < L(A it j - Y it j) 

2 

+ V L 'i A i,j - Y,j)UiVj + Y U i V j- 

Taking the average over all entries, this implies that 

2 

/(A + V e u ' v ) < /(A) + V V u , v f(X) + jL u?4 

Since the last term is at most r/ 2 /n 2 , the result follows. 
■ 

We therefore obtain: 

Corollary 2 Let d(A, Y) be the Huber loss discrep- 
ancy as defined in Lemma 2. Then, for any matrix A, 
GECO can find a matrix A with d(A, Y) < d(A, Y) + e 

and rank(A) = 0(^f ). 
6. Experiments 

We evaluated GECO for the problem of matrix 
completion by conducting experiments on three 
standard collaborative filtering datasets: Movie- 
Lens 100K, MovieLenslM, and MovieLenslOM 6 . The 
different datasets contain 10 5 ,10 6 ,10 7 ratings of 
943, 6040, 69878 users on 1682, 3706, 10677 movies, re- 
spectively. All the ranking are integers in 1 — 5. We 
partitioned each data set into training and testing sets 
as done in (Jaggi & Sulovsky, 2010). 

We implemented GECO while applying two of the vari- 
ants described in Section 2.1 as we explain in details 

6 Available through www.grouplens.org 
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Figure 1. Root Mean Squared Error on the test set as a function of the rank. The horizontal line corresponds to the 
minimal error achieved by JS. Left: MovieLenslOOk, Middle: MovieLenslM, Right: MovieLenslOM. 



below. The first variant (see Section 2.1.1) tries to find 
update vectors (u' , v') which leads to a larger decrease 
of the objective function relatively to the leading sin- 
gular vectors (u, v) of the gradient matrix VR(A). In- 
spired by the proof of Theorem 1, we observe that the 
decrease of the objective function inversely depends on 
the smoothness of the scalar function R(A+7]uv t ). We 
therefore would like to find a pair which on one hand 
has a large correlation with VR{A) and on the other 
hand yields a smooth scalar function R(A+r]uv t ). The 
smoothness of R{A + r)uv ) is analyzed in Lemma 1 
and is shown to be at most rgy. Examining the proof 
lines more carefully, we see that for balanced vectors, 
i.e. Ui = ±-t=,tyi = ±^?=, we obtain a lower smooth- 

ness parameter of — . Thus, a possible good update 
direction is to choose u, v that maximizes u T V R{A)v 
over vectors of the form Ui = ±4=,Ui = ±^4=. This 

is equivalent to maximizing w T V R(A)v over the 
balls of M. m and R™ , which is unfortunately known to 
be NP-hard. Nevertheless, a simple alternate max- 
imization approach is easy to implement and often 
works well. That is, fixing some u, we can see that 
v = sign(M T V(A))/ \fn maximizes the objective, and 
similarly, fixing v we have that u = siga(\7 R(A)v) / ^/m 
is optimal. We therefore implement this alternate 
maximization at each step and find a candidate pair 
(u',v'). As described in section Section 2.1.1, we com- 
pare the decrease of loss as obtained by the leading 
singular vectors, (u,v), and the candidate pair men- 
tioned previously, (u',v'), and update using the pair 
which leads to a larger decrease of the objective. We 
remind the reader that although (u',v') are obtained 
heuristically, our implementation is still provably cor- 
rect and our guarantees from Section 3 still hold. 

In addition we performed the additional replacement 
steps as described in Section 2.1.2. For that purpose, 
let q be the number of times we try to perform addi- 
tional replacement steps for each rank. Each replace- 



ment attempt is done using the alternate maximiza- 
tion procedure described previously. After utilizing q 
attempts of additional replacement steps, we force an 
increase of the rank. In our experiments, we set q = 20. 
Finally, we implemented the ApproxSV procedure us- 
ing 30 iterations of the power iteration method. 

We compared GECO to a state-of-the-art method, re- 
cently proposed in (Jaggi & Sulovsky, 2010), which we 
denote as the JS algorithm. JS, similarly to GECO, 
iteratively increases the rank by computing a direction 
that maximizes some objective function and perform- 
ing a step in that direction. See more details in Sec- 
tion 1.1. In Figure 1, we plot the root mean squared 
error (RMSE) on the test set as a function of the rank. 
As can be seen, GECO decreases the error much faster 
than the JS algorithm. This is expected — see again 
the discussion in Section 1.1. We observe that GECO 
achieves slightly larger test error on the small data set, 
slightly smaller test error on the medium data set, and 
the same error on the large data set. On the small data 
set, GECO starts to overfit when the rank increases 
beyond 4. The JS algorithm avoids this overfitting by 
constraining the trace-norm, but also starts overfit- 
ting after around 30 iterations. On the other hand, on 
the medium data, the trace-norm constraint employed 
by the JS algorithm yields a higher estimation error, 
and GECO, which does not constrain the trace- norm, 
achieves a smaller error. In any case, GECO achieves 
very good results while using a rank of at most 10. 

7. Discussion 

GECO is an efficient greedy approach for minimizing 
a convex function subject to a rank constraint. One 
of the main advantages of GECO is that each of its 
iterations involves running few (precisely, 0(log(n))) 
iterations of the power method, and therefore GECO 
scales to large matrices. In future work we intend to 



Large-Scale Convex Minimization with a Low-Rank Constraint 



apply GECO to additional applications such as mul- 
ticlass classification and learning fast quadratic classi- 
fiers. 
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A. Proofs 

A.l. Proof of Theorem 1 

To prove the theorem we need the following key 
lemma, which generalizes a result given in (Shalev- 
Shwartz et al., 2010). 

Lemma 3 Assume that f is /3-smooth. Let I, I be 
two subsets of U x V. Let X be a minimizer of /(A) 
over all vectors with support in I and let X be a vec- 
tor supported on I. Assume that /(A) > /(A), de- 
note s = \\X\\i, and let r <E [0,1). Let (u,v) = 
ApproxSV(Vi?(j4(A)), e). Then, there exists r\ such 
that 



/(A)-/(A + ?7 e^)> 



(/(A)-/(A)) 2 (l-r) 2 
2(3s 2 



Proof 

Without loss of generality assume that A > (if 
X Piq < for some (p, q) we can set A_ Pi9 = — X p _ q 
and X Pi q — without effecting the objective) and as- 
sume that u t V(R(A(X)))v < (if this does not hold, 
let u = -u). For any (p,q), let V p , q = p t \7R(A(X))q 
be the partial derivative of / w.r.t. coordinate (p, q) 
at A and denote 



Pv 2 



Note that the definition of (it, v) and our assumption 
above implies that 

-V„,^ = \V U J > (1 - r)max|V„„| , 

p,q 

which gives 

V u .„ < (r - 1) max |V„.„| = (1 - r) min V„.„ . 
Therefore, for all r\ > we have 

fin 2 

Qu,v(ri) < /(A) + (1 - rj^minVp,, + — - . 

p,q 2 

In addition, the smoothness assumption tells us that 
for all 7] we have /(A + ne u ' v ) < Q u>v (t]). Thus, for 
any n > we have 

mmf(X + ae u < v )<f(X + rie u ' v )<Q U!V (r)) 

a 

Combining the above we get 

min/(A + ae^') < /(A) + (l-r)77 min V p , 9 + ^-. 
a (p,g)ei\J 2 

Multiplying both sides by s and noting that 

s min V Pi g < V, K,g^p,g 
( P ,q)ei\i 

(p,q)£l\i 



we get that 

smmf(X + ae u ' v ) 

a 

< sf(X) + (1 - T)n ~ X p^p,i + s 

M&l\i 



Prf 
2 



Since A is a minimizer of / over I we have that V Pi9 = 
for (p, q) 6 /. Combining this with the fact that A is 
supported on / and A is supported on / we obtain that 

J2 \,^ P , q = (A, V/(A)) = (A - A, V/(A)) . 

From the convexity of / we know that (A— A, V/(A)) < 
/(A) — /(A). Combining all the above we obtain 



i/(A+ae"'*) < s /(A)+(l-rM/(A)-/(A))+ S 



Pv 2 



This holds for all r\ > and in particular for r\ = 
(/(A) - /(A))(l - t)/(s/3) (which is positive). Thus, 



smin/(A + ae u < 1 ') < s/(A) - 



(/(A)-/(A)) 2 (l-r) 2 
2fis 



Rearranging the above concludes our proof. 



Equipped with the above lemma we are ready to prove 
Theorem 1. 

Fix some A and let A be the vector of its singu- 
lar values. Thus, ||A||i = ||A|| tr and /(A) = R{A). 
For each iteration i, denote ej = f(X^) — /(A), 
where A' 1 ) is the value of A at the beginning of it- 
eration i of GECO, before we increase the rank to 
be i. Note that all the operations we perform in 
GECO or one if its variants guarantee that the loss 
is monotonically non-increasing. Therefore, if < e 
we are done. In addition, whenever we increase the 
rank by 1, the definition of the update implies that 



/(A«+D) < 



/(AW + jye"^), where (u,v) = 



ApproxSV(i?(A(AW)), r). Lemma 3 implies that 

<.- e .«=/(A«>)-/(A«+»)> ffpf . (4) 

Using Lemma B.2 from (Shalev-Shwartz et al., 2010), 
the above implies that for i > 2 j3 || A|| 2 r /(e(l — r) 2 ) we 
have that e t < e. We obtain that if ||A|| 2 r < e(r + 
1)(1 — t) 2 /(2/3) then e r+ i < e, which concludes the 
proof of Theorem 1 . ■ 



A. 2. Proof of Theorem 2 

Let A be the vector obtained from the SVD of A, 
that is, A = A(X) and ||A|| = rank(A). Note that 
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/ is cr-strongly-convex over the support of A. Using 
Lemma 2.2 of (Shalev-Shwartz et al., 2010) we know 
that ||A||2 < «W, But, since p|| tr = ||A|| 1; 
rank(A) = ||A|| , and /(0) = R(0), we get 

- 2 2rank(i)i?(0) 

ll^lltr < - ■ 

The proof follows from the above using Theorem 1. ■ 



